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Asymptotic Properties of Bayesian Predictive 
Densities When the Distributions of Data and 
Target Variables are Different 


Fumiyasu Komaki*^ 


Abstract. Bayesian predictive densities when the observed data x and the target 
variable y to be predicted have different distributions are investigated by using 
the framework of information geometry. The performance of predictive densities 
is evaluated by the Kullback-Leibler divergence. The parametric models are for¬ 
mulated as Riemannian manifolds. In the conventional setting in which x and 
y have the same distribution, the Fisher-Rao metric and the Jeffreys prior play 
essential roles. In the present setting in which x and y have different distributions, 
a new metric, which we call the predictive metric, constructed by using the Fisher 
information matrices of x and y, and the volume element based on the predictive 
metric play the corresponding roles. It is shown that Bayesian predictive densities 
based on priors constructed by using non-constant positive superharmonic func¬ 
tions with respect to the predictive metric asymptotically dominate those based 
on the volume element prior of the predictive metric. 

Keywords: differential geometry, Fisher-Rao metric, Jeffreys prior, Kullback- 
Leibler divergence, predictive metric. 


1 Introduction 

Suppose that we have independent observations a:(I), a;(2),..., x{N) from a probability 
density p{x \ 9) that belongs to a parametric model {p{x \ 9) \ 6 & 0}, where 9 = 
{9^, 0^,..., 9‘^) is an unknown d-dimensional parameter and 0 is the parameter space. 
The random variable y to be predicted is independently distributed according to a 
density p{y | d) in a parametric model {p{y \ 9) \ 9 G 0}, possibly different from 
{p{x \ 9)\9 G 0}, with the same parameter 9. The objective is to construct a predictive 
density p{y;x^) for y by using x^ := (^(l),... ,x{N)). The performance of p{y\x) is 
evaluated by the Kullback-Leibler divergence 

D{p{y I 9),p{y]x^)) := Jp{y \ djlog^^^^dy 
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from the true density p(y | 0) to the predictive density p{y;x^). The risk function is 
given by 


E 


DiPiy \0),p{y;x )) 


p{x^ I d)p{y I 6»)log^-^dyda;^. 


It is widely recognized that plug-in densities p{y \ 6) constructed by replacing the un¬ 
known parameter 0 by an estimate 6(x^) may not perform very well and that Bayesian 
predictive densities 

^ f I I I d)TT{e)de 

>■ I 0)7r(0)d0 

constructed by using a prior tt perform better than plug-in densities. If the value of 6 
is given, there is no specific meaning of considering the conditional density of y given 
x^ since the obvious relation p{y \ x,9) = p{y \ 9) holds. However, if 9 is unknown, 
Bayesian predictive densities PTv{y \ x^) constructed by introducing a prior density 7r{9) 
on the parameter space are useful to approximate the true density p{y | 9) as discussed 
in Aitchison and Dunsmore (1975) and Geisser (1993). In fact, there exists a predictive 
density whose asymptotic risk is smaller than that of a plug-in density unless the mean 
mixture curvature of the model manifold vanishes, see Komaki (1996) and Hartigan 
(1998) for details. The choice of tt becomes important especially when the sample size 
N is not very large. Although the Jeffreys prior is a widely known default prior, it does 
not perform satisfactorily especially when the unknown parameter is multidimensional 
as Jeffreys himself pointed out. 

Komaki (2001) constructed a Bayesian predictive density incorporating the advan¬ 
tage of shrinkage methods for the multivariate normal model. See also George et al. 
(2006) for useful results for the normal model. 

In the conventional setting in which the distributions of x{i), i = l,...,iV, and 
y are the same, asymptotic theory of prediction based on general parametric models 
has been studied by using the framework of information geometry, see Komaki (1996). 
In information geometry, a parametric statistical model is regarded as a differentiable 
manifold, which we call the model manifold, and the parameter space is regarded as 
a coordinate system of the manifold, see Amari (1985). The Fisher-Rao metric is a 
Riemannian metric based on the Fisher information matrix on the model manifold. The 
Jeffreys prior 7rj(0) corresponds to the volume element of the model manifold associated 
with the Fisher-Rao metric. When the distributions oi x{i), i = 1,N, and y are the 
same, the asymptotic difference between the risks of pT^iy \ x^) and pjijj \ x^) is given 

by 




P{D{p{y I 9),p^{y \ x^) \ 9) - P{D{p{y \ 9),pj{y \ x^) \ 9] 



^ d d di 



A 



+ o(l) — 2 



0 ( 1 ), ( 1 ) 












F. Komaki 


33 


where di denotes 9/90*, gij := E{9ilogp(x | 0)dj logp(x \ 0) \ 0}, 5 *^ denotes the (i,/)- 
element of the inverse of the d x d matrix {gtj), and A is the Laplacian, see Komaki 
(2006). The Laplacian A on a Riemannian manifold endowed with a metric gij is defined 
by 

A.f = ( 2 ) 

i j i j 

where |g| is the determinant of the d x d matrix {gij), / is a smooth real function on 
0, and denotes the covariant derivative, defined in the next section. The indices 
. run from 1 to d. Note that both the definition (2) of the Laplacian and the 
definition A/ = —\g\~^^‘^ J2i J2j fbat differs in sign are widely adopted 

in the mathematics literature, although it is confusing. Because of (1), if there exists 
a non-constant positive superharmonic function /, i.e. a non-constant positive function 
satisfying A/ < 0 for every 0, on the model manifold, then the Bayesian predictive 
density based on the prior density defined by tt = /ttj asymptotically dominates that 
based on the Jeffreys prior. Here, the Riemannian geometric structure of the model 
manifold based on the Fisher-Rao metric plays a fundamental role. 

In practical applications, it often occurs that observed data x{i), i = 1,..., N, and 
the target variable y to be predicted have different distributions. Regression models are 
a typical example. Suppose that we observe x = IT0-|-e, where VL is a given nxd matrix 
{n> d), and predict y = Z9 + e, where Z is a given mxd matrix and 0 = (0^, • • • , 0*^) is 
an unknown parameter. Then, the Fisher information matrices for the same parameter 
0 based on p{x \ 0) and p{y \ 0) are different. Similar situations also occur in nonlinear 
regression problems. Kobayashi and Komaki (2008) and George and Xu (2008) showed 
that shrinkage priors are useful for constructing Bayesian predictive densities for linear 
regression models when the observations are normally distributed with known variance. 
However, it has been difficult to construct useful priors for general models other than 
the normal models when x and y have different distributions. 

In the present paper, we study asymptotic theory for the setting in which x{i), i = 
1,... ,iV, and y have different distributions. Although several asymptotic properties of 
predictive distributions for such a setting are studied by Fushiki et al. (2004), the result 
corresponding to (1) has not been explored. The generalization is not straightforward 
because two different differential geometric structures, one for p{x \ 0 ) and the other for 
p{y I 0), such as the Fisher-Rao metrics exist in the present setting. 

We introduce a new metric 5 ^-, which we call the predictive metric, depending 
on both p{x I 0) and p{y \ 0). The predictive metric g^j and the volume element 
... d0‘^ of it correspond to the Fisher-Rao metric and the Jeffreys prior in 
the conventional setting. 

In Section 2, we obtain an expansion of the difference of the risk functions of Bayesian 
predictive densities. Each term in the expansion is represented by using geometrical 
quantities and is invariant with respect to parameter transformations. In Section 3, we 
introduce the predictive metric and evaluate the asymptotic risk difference between 
a Bayesian predictive density based on a prior tt and that based on the volume element 
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prior of the predictive metric The asymptotic risk difference is rep¬ 

resented by using the Laplacian associated with the predictive metric In Section 4, 
we consider three examples and construct superior priors by using the formula obtained 
in Section 3. 


2 An expansion of the risk of predictive densities 

First, we prepare several information geometrical notations to be used. In the following, 
the quantities associated with the model {p{x \ 0 ) | 0 € 0 } are denoted without tilde, 
and those associated with the model {p{y \ 9)\9 G 0)} are denoted with tilde. We put 
I := logp(a; | 9) and I := \ogp{y \ 9). The Fisher-Rao metrics on the model manifolds 
{p{x I 0 ) I 0 G 0 } and {p{y | 0 ) | 0 G 0 } are given by 

g^.(g) := E {dildjl I 9) and 3^(0) := E{dJdjl\9), 

respectively. The (i, j)-elements of the inverses of the d x d matrices ( 3 ^) and ( 3 ^) are 
denoted by 3 *-^ and 3 *-’ , respectively. We define 


T^jkiS) ■■= E(ddd,m I 0 ), := E(djd,m 


rglw :=E 


(d.djidki 10 ), 




(didjldj 


tI^\9) := E(d.d,ldkl I 9 ) + := E(^d.d,m\ 9 ) + T,^,{9) 

^ {d^g,,{ 0 ) + d,g,M - 9 , 3 ., ( 0 )}, 


w + rSV)} = ^ {5.3,feW + 5,^(0) - 5fc3.,(0)} 


and 


Q,^ki{9) := Ei^djdjdkldil 


9]. 


Here, F^®{ are the e-connection coefficients, are the m-connection coefficients, and 
Eijfc the Riemannian connection coefficients. The relations 

= and d,g^, = tl;l+fl^ (3) 

represent the duality between rg{ and with respect to the metric g^j, and the 
duality between and with respect to the metric g^p respectively. 

Covariant derivatives , , and of a vector field with respect 

to the connection coefficients Fg^^, and are defined by vg^w-^ := diU^ + 

vg^M-^ := diU^ + Y^^.Tff}^ , and vg^u-^ := rg^ufc^ respectively, 

where Fg)^ = E; rg] 3 ^', rg^^' = Ez rg] 3 ^', and = E, In the same way, 
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the covariant derivatives and with respect to the connection 

coefficients f = Y.i , and f^*f = Y.i are defined. 

Theorem 1 below is used in the following sections. 


Theorem 1. The difference between the risk functions of Bayesian predictive densities 
pTriy I and pTv'iu \ x^) based on priors 7r(0)d0 and 7r'(6>)d0, respectively, is given by 




F{D{p{y I e),p^{y \ x^) \ 9] - F{D{p{y \ e),p^.{y \ x^) \ 0} 


= 9 II + H 9^j9'''^ 




^,3 




+ J29^J9^’"'^k^K' I +o(l), 


( 4 ) 


^:3 


where 

K{9) :=J29^H0){dk\og7T{0) - (0)} + - r(“)*(0)}. 

k j k,l 


The proof of Theorem 1 is given in the Appendix. 


3 Prior construction based on the predictive metric 

In this section, we introduce a new metric defined by 

d d 

^'^3 ■ ^ ^ ^ ^ 9ik9 9jh (^) 

k^l1^1 

which we call the predictive metric. Since g- is positive definite, it can be adopted 
as a Rieinannian metric on 0. It will be shown that the predictive metric g-, the 
corresponding volume element 

d d 

7rp(6l)d6l := \g,j\^de = |H H|"d6» = \g,^\\g,^\-id0, (6) 

fc=i 1=1 

and the Laplacian A based on g^^ play essential roles corresponding to those played 
by the Fisher-Rao metric gtj, the Jeffreys prior \gij\^/‘^d9, and the Laplacian A based 
on gij in the conventional setting where g^j = g^j. Here, \gij\, Iffyl, and \gij\ denote 
determinants of d x d matrices (g^), (Sij), and (ffy), respectively. The (i, j)-element of 
the inverse of the d x d matrix {g^^) is given by g^^ := i 9ki9^^9^^- 
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Here, we give an intuitive meaning of the predictive metric g^j by a nonrigorous 
argument. In the standard estimation theory, the Fisher-Rao metric which is the 
Fisher information matrix, corresponds to the inverse of the asymptotic variance of the 
maximum likelihood estimator. In the setting we consider, the asymptotic variance of 
the maximum likelihood estimator based on is where g is the dx d matrix 

(^y), and the asymptotic variance of the maximum likelihood estimator based on both 
of and yis {Ng+g)~^, where g is the dxd matrix {gij ). The inverse of the reduction 
of the asymptotic variance by observing y in addition to x(i) (i = I,..., N) are given 
by {{Ng)~^ — {Ng + 5 )“^}“^ = N^g + 0{N), as we see in Example I in Section 4, 
corresponding to the predictive metric g. 

The Riemannian connection coefficients with respect to the predictive metric g^^ are 
given by 

\ {di9jk + dj9ki - 9kg^j) , 
and we put f = E; Then, 

dklog\g,j\^ =iafclog|ffy| = (7) 

In the same way, we have 

and log ^ f. (8) 


Thus, 


^kT =9k log =dk log \g,j\ - \dk log 1 = 2 y] F^”^* - y] . (9) 


The Laplacian A with respect to the predictive metric g^j is defined by 

A/ = E = E + E f 

i,3 i,j,k 

i,j k 

where = diU^ + J^k j / is a real smooth function on 0. 

By using these quantities, we obtain the following theorem corresponding to (1) in 
the conventional setting. 

Theorem 2. The difference between the risk functions of Bayesian predictive densities 
pTriy I based on a 7r(0)d0 and pp{y \ x^) based on 7rp(d)dd is given by 
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E{-D(p(y I 0),p^{y I x^) I 0} - F{D{p{y \ 0),ppiy \ x^) \ 0] 




+ o(l) — 2 





0 ( 1 ). ( 10 ) 


The proof of Theorem 2 is given in the Appendix. 

If there exists a positive constant c such that tt' = ctt, we identify the prior tt' with 
TT because the posterior densities based on them are identical. In fact, the risk difference 
(10) between tt and ttp coincides with that between tt' and TTp. 


Corollary 1. If a positive function f{0) is super harmonic with respect to the predictive 
metric p, i.e. A/(0) < 0 for every 0 G 0, and the strict inequality holds at a point 
in 0, then the Bayesian predictive density based on the prior density {f{0)}'^T^p{0) 
asymptotically dominates the Bayesian predictive density pp{y \ x^) based on the prior 
density ttp{0). If there exists a non-constant positive super harmonic function f(0) with 
respect to the predictive metric g, then the Bayesian predictive density based on the 
prior density {f{0)}^‘^Trp{0) (0 < c < 1) asymptotically dominates pp(y | x^). 


Proof. The first statement is a straightforward conclusion from Theorem 2. We show 
the second statement. The function {/(0)}°(O < c < 1) is superharmonic because 
A/° = — (1 — c)f~^g^^difdjf} < 0 if f{0) is a positive superharmonic 

function. The strict inequality holds at 0 satisfying dif{0) 0 for any i. Such 0 exists 
since /(0) is a non-constant function. Thus, the second statement follows from the first 
statement. □ 

By setting c = 1/2, it follows from Corollary 1 that the Bayesian predictive density 
based on the prior f{0)TTp{0) asymptotically dominates the Bayesian predictive density 
based on ttp if f{0) is a non-constant positive super harmonic function. 

Note that Corollary 1 also holds if we replace the predictive metric g with another 
metric g' satisfying g' = eg with a positive constant c. This is because the volume 
element with respect to g' is proportional to that with respect to g and the relation 
A'/ = (l/c)A/ holds, where A' is the Laplacian with respect to g'. 


4 Examples 

In this section, we see three examples. We verify that the results in the previous sections 
are consistent with several known results in Examples 1 and 2 and obtain some new 
results in Examples 2 and 3. 

Example 1. Normal models 

Suppose that x is distributed according to the d-dimensional normal distribution 
Nd(^, S) with mean vector g and covariance matrix E = (E®-^) and that y is distributed 
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according to the d-dimensional normal distribution Nd(p, S) with the same mean vector 
/i and possibly different covariance matrix E = Here, p is the unknown parameter 

and E and E are known. 

The Fisher information matrix for p{x \ p) is {g^) = (E^) and that for p{x \ p) is 
idij) — where (Ey) and (Ey) are inverse matrices of (Ey) and (E^), respectively. 

Since the coefficients of the predictive metric i 9ik9^^9ji depend on 

p, the volume element with respect to the predictive metric is 


7rp(/r)d/r = \ °g^j\dp oc dp, 

which is the uniform distribution 7ru(/i) oc 1. 

Kobayashi and Komaki (2008) and George and Xu (2008) considered shrinkage pri¬ 
ors for this model. The Bayesian predictive density pTrijj \ x^) dominates p\}{y \ x^) 
based on the uniform measure tt\j{p) if 7r(/r) is a superharmonic function on the Eu¬ 
clidean space endowed with the metric ((iV g)~^ — {Ng + g)~^) , see Theorem 3.2 

in Kobayashi and Komaki (2008). This result holds for every positive integer N. 

Since 


{{Ng)-^-{Ng+g)-^) ^ ={Ng)i [/-{/+(7V5)-5g(iV5)-5 


1 -1 


{Ng)^ 


={Ng)-^ [l-I + {Ng)--^giNg)--^ + 0(iV-2)J {Ng)i 
=N^ g g-" g + 0{N) 


corresponds to the predictive metric g, Theorem 2 is consistent with theoretical and 
numerical results in Kobayashi and Komaki (2008) and George and Xu (2008). 


Example 2. Location-scale models 

Suppose that Pix) and (f>{y) are probability densities on R. that are symmetric about 
the origin. Let 


p{x I p, cr)dx := — (/) 
a 


X — p 


dec and p{y \ p,a)dy := —cj) 

a 


y- 9 

a 


dy, 


where /i £ R and cr > 0 are unknown parameters. 

Suppose that we have a set of N independent observations x(l),..., x{N) distributed 
according to p{x \ p,cr). The variable y to be predicted is independently distributed 
according to p{y \ p,o). The objective is to construct a prior tt for a Bayesian predictive 
density p^{y \ x). 

The Fisher-Rao metrics on the model manifolds {p(x \ PjCr)} and {p(y \ p,a)} are 

=— =— =n 

fj2 ’ 9tTt7 ^2 ’ 9fj,a 
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~2 ’ 9aa 2 ’ 9na 


respectively, where a and b are positive constants depending on (^(x), and d and b are 
positive constants depending on 

The predictive metric is given by 

„ a^/d „ _ ^ „ H ° - n 

9fifi ^2 ’ 9(T(t ^2 ’ Q-nd 9^a 

Define 


u \= 



V := a 


( 11 ) 


by rescaling the location parameter /r. We call this coordinate system (it, v) the upper- 
half plane coordinates. Then, the predictive metric is represented by 

° ^ ^ a H ” = n 

9uu 9 5 9vv 9 5 o^nQ Qiiy u, 

coinciding with the metric on the Hyperbolic plane H^{—b/b^), which is a 2-dimensional 
complete manifold with constant sectional curvature —b/b^. Thus, the model manifold 
endowed with the predictive metric g is isometric to H^{—h/b^). 

The volume element with respect to the predictive metric g is given by 


7rp(/i, (7)d/idcr = |g|^/^d/idcr oc —dgda 

and coincides with the Jeffreys priors \g\^/‘^dg,d(7 oc 1/cr^d/idcr for p{x \ g,(j) and 
\g\^^^dfj,da oc Xja^dgda for p{y \ /i, cr). 

The Laplacian on the model manifold endowed with the predictive metric g is given 

by 


b \ 


( 12 ) 


By Corollary 1, the Bayesian predictive density pr(?/ | x) based on the prior 

nB.{fJ,,<T)dfj,d(j oc —d/idcr 
a 

asymptotically dominates pp {y \ x) based on ttr because 


^ Trp{g,a) ^ ^ l/cr 
7 rp(^, cr) l/cr2 


Act = cr^ 


92 b 92 




(7 = 0. 
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By Theorem 2, the asymptotic risk difference is 

\p{D{p{y I 0),pr(2/ I x^) I e} - E{D{p{y \ e),pp{y \ x^) \ 6} 


A(^ 

TTp 


—2 —^ ^ —h o(l) — 2 —— + o(l) — + o(l). 

ttrV 


(13) 


TTp/ 


In fact, it can be shown that the Bayesian predictive density PR{y \ x) exactly 
dominates pp{y \ x) for finite N because ttp is the left invariant prior and ttr is the 
right invariant prior with respect to the location-scale group. The Bayesian procedures 
based on the right invariant prior dominate those based on the left invariant prior in 
many problems associated with group models as shown in Zidek (1969). The prior ttr 
is also derived as a reference prior, see Berger and Bernardo (1992). 

Furthermore, as we see below, the Bayesian predictive density Pc,K,{y \ x) based on 
the prior defined by 

-^{p, a) ■=—^ --- (0<c<l, 0<K< oo) (14) 

+ c{a + k)^ + {1 - c){a'^ + k'^) 
b^a 

asymptotically dominates ppiy \ x) and thus also dominates pp[y \ x). 

To clarify the meaning of the prior "we introduce another coordinate system on 

the model manifold. Let {b/\/h)p be the Riemannian distance based on the predictive 
metric g between a point P and an arbitrary fixed point O on H^(—b/b'^). The direction 
of P from O is represented by a point r on the unit circle in the tangent space at O. 
Then, the point P is represented by p and r, see e.g. Helgason (1984) p. 152. This 
coordinate system (p, r) is called the geodesic polar coordinates. Then, the predictive 
metric is given by 

5 ^ ^,2 

ffpp = 9rr = -^(sinhp)^ and = 0. 

The Laplacian is represented by 


1 f 

b2 I Qp2 


coshp d 
sinh p dp 


(sinhp) 


(15) 


where As is the Laplacian on the unit circle in the tangent space at O, see e.g. Helgason 
(1984) p. 158. 

When the upper-half plane coordinate system is adopted, the Riemannian distance 
{b/\/b)p between {u,v) and {u^v) is represented by 


\u — Mp +v'^+v'^ 


coshp 


2vv 
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see e.g. Davies (1989) p. 176. Thus, in the original coordinate system {n,cr), the 
Riemannian distance {b/\/h)p between and (/r, cr) and (0 ,k) is 


coshp 




2aK 


(16) 


Thus, the ratio of prior densities is given by 


7re,K(At,g) 

7rp(/i,CT) 


+ a^ + K^ 


2aK 


+ c 


1 


cosh p + c 


(17) 

(18) 


Note that tTc,^(m, < 7 )/ 7 rp(/i, cr) depends on (/i, cr) only through p{p,a) defined by (16). 
Thus, from (15), (18), and Theorem 2, we have 




^{D{p{y I e),p^^_^{y\x) |6»} - E{D{p{y \ e),pp{y \x)\9} 


A 


=2 


TTp 


‘ C,K 


TTp 

b (1 


/IN ^ j ^ I ^C,K I 3 , 2\ f ^C,K, 


0 ( 1 ) 


1 3 2 1 

6 ^ 1 2 *"cosh p + c 2 (cosh p + c)^ 


o(l), 


(19) 


and (19) is smaller than (13) when 0 < c < 1 and 0 < k < oo. The asymptotic risk 
difference (19) can also be derived from (17) and the Laplacian (12) in the original 
coordinate system. 

By Corollary 1, the Bayesian predictive density Pc,K{y \ x^) (0<c<oo,0<k< 
oo) asymptotically dominates pviy \ x^) since the function (14) is superharmonic for 
0 < c < oo. However, Pc,K,{,y \ x^) asymptotically dominates ppijj \ x^) only when 
0 < c < 1 . 

Several properties of the function (14) are discussed in Komaki (2007). As k oo, 
the prior tTc^k converges to the right invariant prior ttr, because 


TTc,k{P,Cf) 
7rp(/i, cr) 


1 


o/2 


<0y + a^ + n^ 


+a^ + K^ 


2aK 


2aK 


+ c 


when K oo. Here, priors are identified up to a positive multiplicative constant. As 
K —?> 0 , the prior converges to 


7rc(p, cr)dpdcr : = 


1 




r2 cr- 


rdpdcr = 


I 1 cr- 


rdpdcr. 
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P 

Figure 1; The asymptotic risk difference N^lE{D(p(y | 9),pc,K.{y \ x^) \ 9} — E{D{p{y \ 
9),pY>{y I I 0}]+o(l) = -(&/6^){l/2 + c(7r/7rp) + (3/2)(l-c^)(7r/7rp)}^ for Bayesian 
predictive densities based on ttp, ttr, ttq, 7r„=i_c=o, and 7rK=i,c=i- We put 6/6^ = 1 just 
for simplicity. 


because 

7rp(^, cr) 


1 



2aK 


1/{2k) 





when K —>■ 0. The prior density with respect to the rescaled parameter (u, v) defined by 
(11) is given by 


TTcip, fT)dpd(T oc 


{u/vY 


1 V- 


-dudu. 


( 20 ) 


Note that the Cauchy prior for u, discussed by Jeffreys and many researchers, appears 
in (20). Thus, the class 'Kc,k of priors bridges the right invariant prior ttr, coinciding 
with the reference prior, and the Cauchy prior ttc. 

Figure 1 illustrates the difference between the risk functions of Bayesian predictive 
densities based on ttr, ttc, 7rK=i,c=o, and 71 ^= 1 ,c=i and the risk function of pv{y \ x^). 
The risk functions of the right invariant prior ttr and the Cauchy prior ttc are uniformly 
smaller than that of 7rp. The asymptotic risk of the Cauchy prior ttc coincides with that 
of TTp. Furthermore, the asymptotic risks of 77 ^= 1 ,c=o and 71 ^= 1 ,c=i are smaller than that 
of ttr for every Therefore, the use of (0 < c < 1) is recommended. The 

risk of 77^=1,c=o is smaller than that of 77 , 1 = 1 , 0=1 when p is small, and vice versa. Thus, 
there does not exist a unique best value of c. The choice of the value of 0 < k < 00 
is arbitrary because it corresponds to the center of shrinkage. Finite-sample decision 
theoretic properties such as admissibility of Bayesian predictive densities PK,c{y \ x^) 
based on proposed priors 77^,0 (0 < k < 00 , 0 < c < 1) require further research. 
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Example 3. Poisson models 

Suppose that Xi {i = 1,... ,d) are independently distributed according to the Poisson 
distribution Po(Ai) with mean Xi and that j/i (i = 1,..., d) are independently distributed 
according to the Poisson distribution Po(siAi) with mean SiXi. Here, Si are known 
positive constants. The unknown parameter is d = := (Ai,...,Ad). The 

objective is to construct a predictive density for y by using x. This problem in the 
conventional setting, in which si = S 2 = ■ • ■ = Sd, is studied in Komaki (2004). 

If Si <C 1 for each i, then this prediction problem is in the asymptotic setting. The 
Fisher-Rao metrics corresponding to x and y are given by 


...H i 

0 

respectively. The predictive metric is 


and 


9ii 


Si_ 

X^ 

0 


(* = j) 
(* j) 


9r 


— s SiXi 


(* = j) 
(* 7^ j) 


and the corresponding volume element is 
7rp(A)dA := |g|^/^dA =1]^ 


dA 


.2=1 


(s,A,)V2| (Ai.--A,)V2 


dA 


coinciding with the Jeffreys priors for p{x \ X) and p{y | A). 

The Laplacian A based on the predictive metric g is given by 

/ d \ d 


where / is a smooth real function of A. 
Define 


^X, df 


+ 2dX, 


i\\A\ _ (^i/si + • • • + Ad/sd) ^ ^ ^ ^ 

7''s(A)dA -—---dA oc (Ai/si + • • • + Ad/Sd) 


( 21 ) 




-(d/2-l) 


i^r/^dA. 


Then, from 


d 7rs(A) 

dXi 7rp(A) 

d'^ 7rs(A) 


aA? 7rp(A) 



/Ai 

Ad\ 


—+ • 


2 J 

Vsi 

SdJ 


? •■)-!)& 


H-+ 


Ad 

Sd 


2 1 
Si 

— 1 


( 22 ) 


and 


2 
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we have 


7rs(A) 

Vp(A) 


— ^ ^ [ Aj 


7rs(A) 1 d 7rs(A) 


'cJAf7rp(A) 2 9Ai7rp(A) 


= 0 . 


(23) 


Since tts /ttp is a non-constant positive superharmonic function of A, the Bayesian pre¬ 
dictive density PS (y I x) based on tts asymptotically dominates pp(y | x) by Corollary 1 . 

The model manifold endowed with the predictive metric is isometric to the first 
orthant R" = {(x^, • • • , x”) : x^ > 0,x^ > 0, • • • , x” > 0} of the Euclidean space M”, 
as we see below. Define _ 


Then, 


C'=2\ — = 


d^ _ (* = *') 


Thus, from (21), the coefficients of the metric with respect to (^* ) are given by 

„ de^ 


° _ V ° 

9i'j' ~ 2^ 9i: 


1 (*'=/) 
0 


This coincides with the usual metric on R." . 


Here, the function 


lien 


-d+2 


TTS (A) 

7rp(A) 


Ai 

Si 


Ad 

Sd 


1+1 


of ^ is the Green function of the heat equation on R" and plays an essential role in 
Bayesian methods for model manifolds isometric to the Euclidean space. For example, 
the prior density ||p||~‘^~'’^ for the d-dimensional Normal model Nd{p,Id), where p is 
the d-dimensional unknown mean vector and Id is the d x d identity matrix, is known 
as the Stein prior. 


The Bayesian predictive density based on ttp is 

d 


n 


ppiy I x) =- 




.1 


Vt'- 


i=i 


nff< 


-A 


n 

2 = 1 


RAr'^'dA 

i=i 

r(xi + Ui + 1/2) 1 


(1 -f Si)^i+!/i + l/2 


xPuil 


n 

2 = 1 


r(x^ +1/2) 


Xil 
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_sf_r(a;, +yi + 1/2) 

(1 + Sj) 2 :i+!/i + l /2 r(xi + l/2)y^\ 


where dA := dAi • • • dA^. 

The Bayesian predictive density based on tts is 


Psiy I a;) = 






i=i 


Y[Xk-^^^dX 




n 

2=1 


X 

^2 _ ) 


d X \ -(d/ 2 - 1 ) d 

Aj — 


E 


• 1 


fe=i 


n 


gyi^a;i+yi-l/2 

' ' , -' 

yp 


(l+Si) Ai 


n Afe-'/'dA 


M 2 ^expl — 


dw, > dA 


J n I / ^ expf-^y^ —)dM > dA 

i=l 1-^0 i ^ \ 


pOO d 

/ u^~^Y[i^ + Si + u/si)-^^'+y'+^/^Uz 

Jo i=l _ 

pco d 

/ TT(i + u/si) 

Jo .I 


-( 2 .. + l/ 2 )d^ 


-pr s^’r(a;i + Pi + 1/2) 

ydr(x. + 1 / 2 ) • 


We have the asymptotic risk difference 




^{DiPiy I ^),Psiy I a;) I 0} - F{D{p(y \ 0),pp{y | a:) | 0} 

0 ( 1 ) 


A(7rs/7rp) _ 1 „yft(7rs/7rp)5j(7rs/7rp) 

9 ^ J 


tts/ttp 

1 /d 


- 1 


2 V 2 

by Theorem 2, (22), (23), and 


Si 


+ 


(7rs/7rp)2 

SdJ 


(24) 


StX, {i = j) 

0 j). 


The asymptotic risk difference (24) depends on A only through Ai/si + • • • + Xd/sd- 
When Ai/si + • ■ • + Xd/sd is small the improvement is large, and it converges to zero 
as Ai/si + • ■ • + Ad/sd goes to infinity. 

It can be shown that tts dominates ttp in the sense of infinitesimal prediction, and we 
can construct a Bayesian predictive density dominating pp{y \ x) for arbitrary Si > 0 
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(i = l,...,d) by modifying the prior tts. Finite sample properties of this prior will 
be discussed in a another paper by using an approach different from the asymptotic 
methods in the present paper. 

In Examples 1, 2, and 3, the volume element based on the predictive metric 
coincides with the Jeffreys priors based on gij and giy i.e. oc \gij[0)\^/‘^ oc 

although the three metrics are different. In general, if two metrics gij and 
gij satisfy the relation 


(25) 

k,l 


where (A^) is a d x d regular matrix not depending on 9, then 

\9^J\^ = \Ai\\g^j\K and \g^j\^ = \gij\\gij\~^ = \Ai\-^\gij\^ 

and the volume elements based on gij, gij, and are proportional to each other. The 
relation (25) appears in many examples as in Examples 1, 2, and 3. 


Appendix. Proofs of Theorems 1 and 2 


First, we prepare a preliminary result, Theorem Al, to prove Theorem 1. 

Asymptotic properties of predictive densities in the conventional setting in which 
x{i), i = l,...,N, and y have the same distribution have been studied, see Komaki 
(1996), Hartigan (1998), and Sweeting et al. (2006). 

Fushiki et al. (2004) generalized these results for the setting in which x{i), i = 
1,..., A^, and y have different distributions. The Bayesian predictive density is expanded 
as 


P^{y I = p{y I ^mle) + | didjp{y I 0mle) “ I 0mle) 




2N 


E 






+ 2^g*'=(0„,le) { ft log7r(0„,ie) 


dkPiy \ 0mle) + Op{N ^), 


(26) 


where 0niie is the maximum likelihood estimator, and ft := 9/90*. The estimator 
minimizing the Bayes risk jFi[D{p{y \ 9),pT^{y \ a:)}|0]7r(0)d0 is given by 




e)+Op(A ) 


(27) 
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where 

<(0) := (0)} + i^/'(0){f -r(“)X0)}, 

k j k,l 

(28) 

which is a covariant vector. 

The expansion of the risk function of a Bayesian predictive density pT^iy \ x^) up to 
the order N~'^ is given in Theorem A1 below. The expansion is invariant in the sense 
that each term is a scalar not depending on parametrization. In Theorem Al, we put 

6) ■.=d,dj \ogp{x I 9) + g,j{e) - ^ vf^^{e)dk logp(a; | 6»), 

k 

vlf{y;d) ■= p(^y\ g) [didjPiy \ 6) | 6»)|, 

^ and := g^^g’^K 

l,m,n j,l 

Here, and are vectors orthogonal to the model manifolds {p{x \ 9)\6 G 0)} 
and {p{y \ 9)\9 G 0)}, respectively. These vectors are closely related to the curvature 
of the manifolds. 


Theorem Al. The expected Kullback-Leibler divergence from the true density p(y | 9) 
to the Bayesian predictive density pT^iy \ x^) based on a prior 7r(0) is expanded as 

F^D{p{y I 9),p^{y \ x^)) | 6»| 


1 


i,j,k 


Y E I 9 ) g,rg. 


i,j,k,l 
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2 A 2 


E E 1 0 ) g^^g^^ - ^ E 
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i,pk 


+ ill E o««9«s“ - E E I«) 9‘V‘ 


i,j,k,l 

1 - y E 

^ V 
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0 ) g^^g^' 


^2 kl ^kl )\^ mn ^mn JU U 
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In 


i,j,k,l,m 


(29) 
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where 

K(0) := ( 0 )} + 

k j k,l 

Outline of the Proof. Expansions of the risk functions corresponding to (29) when the 
distributions of x{i), i = 1,..., N, and y are the same are obtained by Komaki (1996) for 
curved exponential families by using differential geometrical notions and by Hartigan 
(1998) for general models under rigorous regularity conditions. Fushiki et al. (2004) 
obtained several related results when when the distributions of x{i), i = 1,...N, and 
y are different. The expansion (29) is shown by lengthy calculations parallel to those 
in Komaki (1996) and Hartigan (1998) by using the results such as (26), (27), and (28) 
obtained by Fushiki et al. (2004). □ 


The quantity ^ ^ j ki^ y'j -g Efron curvature (Efron, 1975) of the 


1^) 9^^ 9 


ij ki jg mixture 


model manifold {p(x | 0) | 0 S 0} at 0, and J2i j ki^ 
mean curvature discussed in Komaki (1996) of the model manifold {p{y \ d)\6 € 0} 
at 6. 


Proof of Theorem 1. The desired result is obvious from Theorem Al because (29) has 
the form 

p[D{p{y\e),p^{y\x^))\e] 

i,0 i,j,k 

+ terms independent of tt + o{N~'^). (30) 

□ 

To derive Theorem 1, it is sufficient to show (30). Much less calculation is needed 
to verify (30) than to obtain all the explicit terms in (29). 


Proof of Theorem 2. Let f{9) := 7r(0)/7rp(0). Since 


Jee g^5aogMlog/ + Alog/ = 


P 


P 


(31) 


it is sufficient to show that the left-hand side of (10) is equal to {l/2)gddi log fdj logf + 
A log/. 







F. Komaki 


49 


From ( 6 ) and (7), we have 

log TT = di log f + log TTp = 9, log / + ^ f . 


Letr^ ( 


p (m)i _ p(m)i 
^ fcZ 


j,k 


and s* := Efe, - rg'). Then, from (28) 


4=E5*M^log7r-^r 


(e)i 

kj 


= ^ff*M^iog/ + ^f(f-^r 


de)i 

fej 


= ^g*'=afclog/ + s* 


+ rh 


(32) 


Thus, when tt = ttp, Up = s* + r®. From (4), we have 

N'^(E[D{p{y I 6 »),p^( 2 / | x^)] -E[D{p{y \ e),pp{y \ x^)]^ 


= 2 XI ( dkul + Y^. 
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+ X 9ij9^’" { X log/) + X log / > + 0(1) 
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= 2 X 9, log fdj \ogf + Y 9ij (5fc log /) (s^ + ) 

i,j,k 

+ Y 9^j9'^M9^'di log /) + ^ If log / + o(l). 

i,j,k,l i,j,k,l,m 

Let Li := 9^ log/. From (31), it is sufficient to show that 

Ykj9"^Lk{s^ +rn+ Y X 

i,j,k i,j,k,l j,k,l,m 


(33) 


(34) 
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is equal to A log / = Since 

0 = 34 = d.(J^g‘"^g^J = J^(d4nffmn 

mm m 

we have 

Thus, from 

Y,d,{4L,) = Y. d.ig^'^g^^gM 

i^j i,j,k-,l 

= E i9^g44hiLj + E + E 9^'^hA{9^‘L,), 
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Hence, because of the duality (3) of the e-connection and the m-connection, (34) is equal 
to 
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